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ABSTRACT 

Speech denoising has been a long lasting problem in audio processing community. There exist 
lots of algorithms for denoising if the noise is stationary. For example, Wiener filter is suitable for 
additive Gaussian noise. However, if the noise is non-stationary, the classical denoising algorithms 
usually have poor performance because the statistical information of the non-stationary noise is difficult 
to estimate. 
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I. INTRODUCTION 

Schmidt [14 ] use NMF to do speech denoising under non-stationary noise, which is completely different 
than classical statistical approaches. The key idea is that clean speech signal can be sparsely represented by a 
speech dictionary, but non-stationary noise cannot. Similarly, non-stationary noise can also be sparsely 
represented by a noise dictionary, but speech cannot.The algorithm for NMF denoising goes as follows. Two 
dictionaries, one for speech and one for noise, need to be trained offline. Once a noisy speech is given, we first 
calculate the magnitude of the Short-Time-Fourier-Transform. Second, separate it into two parts via NMF, one 
can be sparsely represented by the speech dictionary, and the other part can be sparsely represented by the noise 
dictionary. Third, the part that is represented by the speech dictionary will be the estimated clean speech. 

The enhancement of speech by applying MMSE short-time spectral magnitude estimation in the 
modulation domain. For this purpose, the traditional analysis-modification-synthesis framework is extended to 
include modulation domain processing. We compensate the noisy modulation spectrum for additive noise 
distortion by applying the MMSE short-time spectral magnitude estimation algorithm in the modulation domain. 
A number of subjective experiments were conducted. Initially, we determine the parameter values that maximise 
the subjective quality of stimuli enhanced using the MMSE modulation magnitude estimator. Next, we compare 
the quality of stimuli processed by the MMSE modulation magnitude estimator to those processed using the 
MMSE acoustic magnitude estimator and the modulation spectral subtraction method, and show that good 
improvement in speech quality is achieved through use of the proposed approach. Then we evaluate the effect of 
including speech presence uncertainty and log-domain processing on the quality of enhanced speech, and find 
that this method works better with speech uncertainty. Finally we compare the quality of speech enhanced using 
the MMSE modulation magnitude estimator (when used with speech presence uncertainty) with that enhanced 
using different acoustic domain MMSE magnitude estimator formulations, and those enhanced using different 
modulation domain based enhancement algorithms. Results of these tests show that the MMSE modulation 
magnitude estimator improves the quality of processed stimuli, without introducing musical noise or spectral 
smearing distortion. The proposed method is shown to have better noise suppression than MMSE acoustic 
magnitude estimation, and improved speech quality compared to other modulation domain based enhancement 
methods considered. Speech enhancement methods aim to improve the quality of noisy speech by reducing 
noise, while at the same time minimising any speech distortion introduced by the enhancement process. Many 
enhancement methods are based on the short-time Fourier analysis-modification-synthesis framework. 
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Some examples of these are the spectral subtraction method (Boll 1979), the Wiener filter method 
(Wiener, 1949), and the MMSE short-time spectral amplitude estimation method (Ephraim and Mala, 1984). 
Spectral subtraction is perhaps one of the earliest and most extensively studied methods for speech 
enhancement. This simple method enhances speech by subtracting a spectral estimate of noise from the noisy 
speech spectrum in either the magnitude or energy domain. Though this method is effective at reducing noise, it 
suffers from the problem of musical noise distortion, which is very annoying to listeners. To overcome this 
problem, Ephraim and Mala in 1984 proposed the MMSE short-time spectral amplitude estimator, referred to 
throughout this work as the acoustic magnitude estimator (AME). In the literature (e.g., Cappe, 1984; Scalart 
and Filho, 1996), it has been suggested that the good performance of the AME can be largely attributed to the 
use of the decision -directed approach for estimation of the a priori signal-to-noise ratio (a priori SNR). The 
AME method, even today, remains one of the most effective and popular methods for speech 
enhancement.Recently, the modulation domain has become popular for speech processing. This has been in part 
due to the strong psychoacoustic and physiological evidence, which supports the significance of the modulation 
domain for the analysis of speech signals. Zadeh (1950) was perhaps the first to propose a two-dimensional bi- 
frequency system, where the second dimension for frequency analysis was the transform of the time variation of 
the magnitudes at each standard (acoustic) frequency. Atlas et al. (2004) more recently defines the acoustic 
frequency as the axis of the first short-time Fourier transform (STFT) of the input signal and the modulation 
frequency as the independent variable of the second STFT transform. 

Early efforts to utilise the modulation domain for speech enhancement assumed speech and noise to be 
stationary, and applied fixed filtering on the trajectories of the acoustic magnitude spectrum. For example, 
Hermansky et al. (1995) proposed band-pass filtering the time trajectories of the cubic -root compressed short- 
time power spectrum to enhance speech. Falk et al. (2007) and Lyons and Paliwal (2008) applied similar band- 
pass filtering to the time trajectories of the short-time magnitude (power) spectrum for speech 
enhancement.However, speech and possibly noise are known to be nonstationary. To capture this 
nonstationarity, one option is to assume speech to be quasi-stationary, and process the trajectories of the 
acoustic magnitude spectrum on a short time basis. At this point it is useful to differentiate the acoustic spectrum 
from the modulation spectrum as follows. The acoustic spectrum is the STFT of the speech signal, while the 
modulation spectrum at a given acoustic frequency is the STFT of the time series of the acoustic spectral 
magnitudes at that frequency. The short-time modulation spectrum is thus a function of time, acoustic frequency 
and modulation frequency. This type of short-time processing in the modulation domain has been used in the 
past for automatic speech recognition (ASR), Kingsbury et al. (1998) for example, applied a modulation 
spectrogram representation that emphasized low-frequency amplitude modulations to ASR for improved 
robustness in noisy and reverberant conditions. Tyagi et al. (2003) applied mel-cepstrum modulation features to 
ASR to give improved performance in the presence of non-stationary noise. Short-time modulation domain 
processing has also been applied to objective quality. For example, Kim and Oct (2004, 2005) as well as Falk 
and Chan (2008) used the short-time modulation magnitude spectrum to derive objective measures that 
characterise the quality of processed speech. For speech enhancement, short-time modulation domain processing 
was recently applied in the modulation spectral subtraction method (ModSSub) of Paliwal et al. (2010). Here, 
the spectral subtraction method was extended to the modulation domain, enhancing speech by subtracting the 
noise modulation energy spectrum from the noisy modulation energy spectrum in an analysis-modification 
synthesis (AMS) framework. In ModSSub method, the frame duration used for computing the short-time 
modulation spectrum was found to be an important parameter, providing a trade-off between quality and level of 
musical noise. Increasing the frame duration reduced musical noise, but introduced a slurring distortion. A 
somewhat long frame duration of 256 ms was recommended as a good compromise. 

The disadvantages of using longer modulation domain analysis window are as follows. Firstly, we are 
assuming stationarity which we know is not the case. Secondly, quite a long portion is needed for the initial 
estimation of noise, and thirdly, as shown by Paliwal et al. (2011), speech quality and intelligibility is higher 
when the modulation magnitude spectrum is processed using short frame durations and lower when processed 
using longer frame durations. For these reasons, we aim to find a method better suited to the use of shorter 
modulation analysis window durations. Since the AME method has been found to be more effective than 
spectral subtraction in the acoustic domain, in this paper, we explore the effectiveness of this method in the 
short-time modulation domain. For this purpose, the traditional analysis-modification-synthesis framework is 
extended to include modulation domain processing, then the noisy modulation spectrum is compensated for 
additive noise distortion by applying the MMSE short-time spectral magnitude estimation algorithm. The 
advantage of applying a MMSE-based method is that it does not introduce musical noise and hence can be used 
with shorter frame durations in the modulation domain. 
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II. IMAGE COMPRESSION: 

The objective of image compression is to reduce irrelevance and redundancy of the image data in 
order to be able to store or transmit data in an efficient form. 




Figure 1 : A chart showing the relative quality of various jpg settings and also compares saving a file as 
a jpg normally and using a "save for web" technique. 

in. LOSSY AND LOSSLESS COMPRESSION 

Image compression may be lossy or lossless. Lossless compression is preferred for archival purposes 
and often for medical imaging, technical drawings, clip art, or comics. This is because lossy compression 
methods, especially when used at low bit rates, introduce compression artifacts. Lossy methods are especially 
suitable for natural images such as photographs in applications where minor (sometimes imperceptible) loss of 
fidelity is acceptable to achieve a substantial reduction in bit rate. The lossy compression that produces 
imperceptible differences may be called visually lossless. 

Methods for lossless image compression are: 

• Run-length encoding - used as default method in PCX and as one of possible in BMP, TGA, TIFF 

• DPCM and Predictive Coding 

• Entropy encoding 

• Adaptive dictionary algorithms such as LZW - used in GIF and TIFF 

• Deflation - used in PNG, MNG, and TIFF 

• Chain codes 

Methods for lossy compression: 

• Reducing the color space to the most common colors in the image. The selected colors are specified in the 
color palette in the header of the compressed image. Each pixel just references the index of a color in the 
color palette, this method can be combined with dithering to avoid posterization. 

• Chroma subsampling. This takes advantage of the fact that the human eye perceives spatial changes of 
brightness more sharply than those of color, by averaging or dropping some of the chrominance information 
in the image. 

• Transform coding. This is the most commonly used method. In particular, a Fourier-related transform such 
as the Discrete Cosine Transform (DCT) is widely used: N. Ahmed, T. Natarajan and K.R.Rao, "Discrete 
Cosine Transform," IEEE Trans. Computers, 90-93, Jan. 1974. The DCT is sometimes referred to as "DCT- 
II" in the context of a family of discrete cosine transforms; e.g., see discrete cosine transform. The more 
recently developed wavelet transform is also used extensively, followed by quantization and entropy 
coding. 

• Fractal compression. 
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IV. THE WAVELET TRANSFORM BASED CODING 

The main difference between the wavelet transform (WT) and the discrete cosine transform (DCT) 
coding system is the omission of transform coder's sub-image processing stages in WT. Because WTs are both 
computationally efficient and inherently local (i.e. their basis functions are limited in duration), subdivision of 
the original image is not required. The removal of the subdivision step eliminates the blocking artifact. Wavelet 
coding techniques are based on the idea that the coefficient of a transform which de-correlates the pixels of an 
image can be coded more efficiently than the original pixels themselves [9]. The computed transform converts a 
large portion of the original image to horizontal, vertical and diagonal decomposition coefficients with zero 
mean and Laplacian-like distribution. The 9/7 tap biorthogonal filters[10], which produce floating point wavelet 
coefficients, are widely used in MIC techniques to generate a wavelet transform [11,12,13]. The wavelet 
coefficients are uniformly quantized by dividing by a user specified parameter and rounding off to the nearest 
integer. Typically, a large majority of coefficients with small values are quantized to zero by this step. The 
zeroes in the resulting sequence are run-length encoded, and Huffman and arithmetic coding are performed on 
the resulting sequence. The various subbands blocks of coefficients are coded separately, which improves the 
overall compression [9]. If the quantization parameter is increased, more coefficients are quantized to zero, the 
remaining ones are quantized more coarsely, the representation accuracy decreases, and the CR increases 
consequently. Since the input image needs to be divided into blocs in DCT based compression, correlation 
across the block boundaries is not eliminated. This results in 'blocking artifacts' particularly at low bpp. 
Whereas in WT coding, there is no need to block the input image and its basis functions have variable length 
hence wavelet schemes at higher CRs avoid blocking artifacts. The basic structure of WT based compression 
process is shown in Figure 2 below. The other details of wavelet transform may be referred in [5, 14]. 
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Figure 2: Basic Structure of WT based Compression 

4.1.The Discrete Wavelet Transform 

Wavelet Transforms are based on 'basis functions'. Unlike the Fourier transform, whose basis 
functions are sinusoids, Wavelet Transforms are based on small waves, called 'wavelets' of varying frequency 
and limited duration. Wavelets are the foundation of a powerful signal processing approach, called Multi- 
Resolution Analysis (MRA). As its name implies, the multi -resolution theory is concerned with the 
representation and analysis of signals (or images) at more than one resolution. Hence features that might go 
undetected at one resolution may be easy to spot at another. The Wavelet analysis is based on two important 
functions viz. the scaling function and the Wavelet function. Calculating wavelet coefficients at every possible 
scale is a fair amount of work, and it generates lot of data. If chosen only a subset of scales and positions at 
which to make the calculations, it turns out, rather remarkably, that if chosen scales and positions based on 
powers of two — so called dyadic scales and positions — then the analysis will be much more efficient and just 
as accurate. If the function being expanded is a sequence of numbers, like samples of a continuous function J{x), 
the resulting coefficients are called the discrete wavelet transform (DWT) of/(x). The decomposition process of 
high and low frequency components by using DWT is depicted in Figure 3 in the block diagram [14]. 
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Figure 3 




(a) (b) (c) 

: Two level decomposition in 2-D DWT (a) Oiigin.il Image function level (b) Level one decomposition (c) wo level decomposition 




Figure 4: 2D discrete wavelet transform used in JPEG2000 

In an example of the 2D discrete wavelet transform that is used in JPEG2000, the original image is 
high-pass filtered, yielding the three large images, each describing local changes in brightness (details) in the 
original image. It is then low-pass filtered and downscaled, yielding an approximation image; this image is high- 
pass filtered to produce the three smaller detail images, and low-pass filtered to produce the final approximation 
image in the upper-left. 

In numerical analysis and functional analysis, a discrete wavelet transform (DWT) is any wavelet 
transform for which the wavelets are discretely sampled. As with other wavelet transforms, a key advantage it 
has over Fourier transforms is temporal resolution: it captures both frequency and location information (location 
in time). 

Examples Haar wavelets :The first DWT was invented by the Hungarian mathematician Alfred Haar. For an 
input represented by a list of 2™numbers, the Haar wavelet transform may be considered to simply pair up input 
values, storing the difference and passing the sum. This process is repeated recursively, pairing up the sums to 
provide the next scale: finally resulting in 2" - ldiffer ences and one final sum. Daubechies wavelets The most 
commonly used set of discrete wavelet transforms was formulated by the Belgian mathematician Ingrid 
Daubechies in 1988. This formulation is based on the use of recurrence relations to generate progressively finer 
discrete samplings of an implicit mother wavelet function; each resolution is twice that of the previous scale. In 
her seminal paper, Daubechies derives a family of wavelets, the first of which is the Haar wavelet. Interest in 
this field has exploded since then, and many variations of Daubechies' original wavelets were developed. m 

The Dual-Tree Complex Wavelet Transform (CWT) The Dual-Tree Complex Wavelet Transform 
(CWT) is relatively recent enhancement to the discrete wavelet transform (DWT), with important additional 
properties: It is nearly shift invariant and directionally selective in two and higher dimensions. It achieves this 
with a redundancy factor of only 2 d for d-dimensional signals, which is substantially lower than the undecimated 
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DWT. The multidimensional (M-D) dual-tree CWT is nonseparable but is based on a computationally efficient, 
separable filter bank (FB). [21 Others Other forms of discrete wavelet transform include the non- or undecimated 
wavelet transform (where downsampling is omitted), the Newland transform (where an orthonormal basis of 
wavelets is formed from appropriately constructed top-hat filters in frequency space). Wavelet packet 
transforms are also related to the discrete wavelet transform. Complex wavelet transform is another form. 
Properties The Haar DWT illustrates the desirable properties of wavelets in general. First, it can be performed in 

^( n Operations; second, it captures not only a notion of the frequency content of the input, by examining it at 
different scales, but also temporal content, i.e. the times at which these frequencies occur. Combined, these two 
properties make the Fast wavelet transform (FWT) an alternative to the conventional Fast Fourier Transform 
(FFT). Time Issues Due to the rate-change operators in the filter bank, the discrete WT is not time-invariant but 
actually very sensitive to the alignment of the signal in time. To address the time-varying problem of wavelet 
transforms, Mallat and Zhong proposed a new algorithm for wavelet representation of a signal, which is 
invariant to time shifts.' 31 According to this algorithm, which is called a TI-DWT, only the scale parameter is 
sampled along the dyadic sequence 2 A j (jeZ) and the wavelet transform is calculated for each point in 
time.' 41 ' 51 
Applications 

The discrete wavelet transform has a huge number of applications in science, engineering, mathematics 
and computer science. Most notably, it is used for signal coding, to represent a discrete signal in a more 
redundant form, often as a preconditioning for data compression. Practical applications can also be found in 
signal processing of accelerations for gait analysis,' 61 in digital communications and many others.' 71 [8 " 9 ' It is 
shown that discrete wavelet transform (discrete in scale and shift, and continuous in time) is successfully 
implemented as analog filter bank in biomedical signal processing for design of low-power pacemakers and also 
in ultra-wideband (UWB) wireless communications.' 101 

V. DATA INTERPRETATION AND ANALYSIS 

Comparison with Fourier transform 

To illustrate the differences and similarities between the discrete wavelet transform with the discrete 
Fourier transform, consider the DWT and DFT of the following sequence: (1,0,0,0), a unit impulse. 

The DFT has orthogonal basis (DFT matrix): 

1111 
10-10 
10-1 
1-1 1-1 



while the DWT with Haar wavelets for length 4 data has orthogonal basis in the rows of: 

1111 
1 1-1-1 
1-10 
1-1 

(To simplify notation, whole numbers are used, so the bases are orthogonal but not orthonormal.) 
Preliminary observations include: 

• Wavelets have location - the (1,1,-1,-1) wavelet corresponds to "left side" versus "right side", while the 
last two wavelets have support on the left side or the right side, and one is a translation of the other. 

• Sinusoidal waves do not have location - they spread across the whole space - but do have phase - the 
second and third waves are translations of each other, corresponding to being 90° out of phase, like cosine 
and sine, of which these are discrete versions. 

Decomposing the sequence with respect to these bases yields: 
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(1,0,0,0)= 1(1,1,1,1)4-1(1,1,-1,-1)4-1(1,-1,0,0) HaarDWT 
(1,0,0,0)= 1(1, 1,1,1)4-1(1, 0,-1,0)4-1(1, -1,1,-1) DFT 

The DWT demonstrates the localization: the (1,1,1,1) term gives the average signal value, the (1,1,-1,— 
1) places the signal in the left side of the domain, and the (1,-1,0,0) places it at the left side of the left side, and 
truncating at any stage yields a downsampled version of the signal: 




Figure 5: 



The sine function, showing the time domain artifacts (undershoot and ringing) of truncating a Fourier series. 

The DFT, by contrast, expresses the sequence by the interference of waves of various frequencies - 
thus truncating the series yields a low-pass filtered version of the series: 




2-term truncation 



(1,0,0,0) 

Notably, the middle approximation (2-term) differs. From the frequency domain perspective, this is a 
better approximation, but from the time domain perspective it has drawbacks - it exhibits undershoot - one of 
the values is negative, though the original series is non-negative everywhere - and ringing, where the right side 
is non-zero, unlike in the wavelet transform. On the other hand, the Fourier approximation correctly shows a 
peak, and all points are within i/V their correct value, though all points have error. 

The wavelet approximation, by contrast, places a peak on the left half, but has no peak at the first point, 
and while it is exactly correct for half the values (reflecting location), it has an error of !/ 2 for the other 
values.This illustrates the kinds of trade-offs between these transforms, and how in some respects the DWT 
provides preferable behavior, particularly for the modeling of transients. 
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VI. RESULTS & DISCUSSION 

4.1.0ne level of the transform 

The DWT of a signal jCis calculated by passing it through a series of filters. First the samples are 
passed through a low pass filter with impulse response ^resulting in a convolution of the two: 

y[n] = (x*g)[n] = x[k]g[n - k]. 

k——x 

The signal is also decomposed simultaneously using a high-pass filter h. The outputs giving the detail 
coefficients (from the high-pass filter) and approximation coefficients (from the low-pass). It is important that 
the two filters are related to each other and they are known as a quadrature mirror filter. 

However, since half the frequencies of the signal have now been removed, half the samples can be 
discarded according to Nyquist's rule. The filter outputs are then subsampled by 2 (Mallat's and the common 
notation is the opposite, g- high pass and h- low pass): 

oc 

yitwM = $2 s[%[2« - fc] 

fc=— oo 

JrtughN = ^ z[k}h[2n - k] 

k=—oo 

This decomposition has halved the time resolution since only half of each filter output characterises the 
signal. However, each output has half the frequency band of the input so the frequency resolution has been 
doubled. 



x[n] 



g[n] -'^■PP r( ™ ma1: ' Cin coefficients 

+f 4^2) ► Detail coefficients 



h[n] 



Figure 6: Block diagram of filter analysis 
With the subsampling operator 4- 



(y i k)[n] = y[kn] 



the above summation can be written more concisely. 



yioiv = (x * g) i 2 



However computing a complete convolution * * 5with subsequent downsampling would waste 
computation time. 

The Lifting scheme is an optimization where these two computations are interleaved. 

Cascading and Filter banks 

This decomposition is repeated to further increase the frequency resolution and the approximation 
coefficients decomposed with high and low pass filters and then down-sampled. This is represented as a binary 
tree with nodes representing a sub-space with a different time-frequency localisation. The tree is known as a 
filter bank. 
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Figure 7: A 3 level filter bank 
In case of a 3 level filter bank, at each level in the above diagram the signal is decomposed into low 

2n 
where rtis the 

number of levels. 

For example a signal with 32 samples, frequency range to /«and 3 levels of decomposition, 4 output 
scales are produced: 



Level Frequencies Samples 

Oto/n/ 8 4 
3 /n/8 t0 / n /4 4 

2 /n/4 to / B /2 8 
1 /nAo/n 16 



Level 3 



Level 2 



Level 1 



f,. 

frequency 

Figure 8: Frequency domain representation of the DWT 



Other transforms 

The Adam7 algorithm, used for interlacing in the Portable Network Graphics (PNG) format, is a 
multiscale model of the data which is similar to a DWT with Haar wavelets. Unlike the DWT, it has a specific 
scale - it starts from an 8x8 block, and it downsamples the image, rather than decimating (low-pass filtering, 
then downsampling). It thus offers worse frequency behavior, showing artifacts (pixelation) at the early stages, 
in return for simpler implementation. 
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